The Graduate School HIGH PERFORMANCE RECORD LINKAGE
نویسندگان
چکیده
In current world, the immense size of a data set makes problems in finding similar/identitcal data. In addition, the dirtiness of data, i.e. typos, missing/tilting information, and additional noises usually occurred by careless editing or entry mistakes, makes further difficulty to identify entity-belongs. Therefore, we focus on the faster detection of data referring the same real-world entity from a large size data set under the error prone environments, while the high accuracy of detection is maintained. In this thesis, we study high-performance linkage algorithms using four different applications. First, we introduce the image linkage algorithm to find near-duplicate images with similar characteristics by bridging two seemingly unrelated fields – Multimedia Information Retrieval and Biology. Under this idea, we study how various image features and gene sequence generation methods affect the accuracy and performance of detecting near-duplicate images. Second, we develop the video linkage algorithm using record linkage methods to detect copied videos from a large multi-media database or sites such as YouTube and Yahoo Videos. The utilization of video characteristics is reflected to the hierarchical structure of the proposed algorithms. In addition, the uses of pipe-lined linkage structures accelerate the speed further. Third, the parallel linkage algorithm, the parallelization of the data linkage frame, is introduced, when slow but optimal sequential linkage frames occur where iterative matching operations apply to clean and merge dirty sets. Any data matching functions can be adapted to the proposed parallel framework because a data linkage function is considered as a black box in the parallel scheme. Finally, we introduce a hashed linkage structure based on the locality sensitive hashing (LSH) algorithm. By remedying the poverty of a basic LSH structure to suit linkage problems, the proposed hashing structure reduces the precessing time tremendously comparing to the conventional LSH structures.
منابع مشابه
Probabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملA comprehensive meta-analysis of the predictive validity of the graduate record examinations: implications for graduate student selection and performance.
This meta-analysis examined the validity of the Graduate Record Examinations (GRE) and undergraduate grade point average (UGPA) as predictors of graduate school performance. The study included samples from multiple disciplines, considered different criterion measures, and corrected for statistical artifacts. Data from 1,753 independent samples were included in the meta-analysis, yielding 6,589 ...
متن کاملThe Relationship Between Job Satisfaction and Job Performance Among Midwives Working in Healthcare Centers of Mashhad, Iran
Background and Aim: Job satisfaction represents individuals' positive or negative attitude towards their occupation. Job satisfaction is of high significance in health care field and could affects the quality of patients' health care and satisfaction. Every organization should pay considerable attention to job satisfaction and performance and continually monitor these indices. Therefore, we aim...
متن کاملFactors Influencing Self-Rated Preparedness for Graduate School: A Survey of Graduate Students
Numerous studies have found a host of factors that are likely to result in more successful applications to graduate schools. This study was a retrospective examination of the variables that distinguish graduate students who believed they were better prepared for graduate school. We examined several of these factors, including variables associated with undergraduate education and the individual ...
متن کاملHigh-Performance Computing Techniques for Record Linkage
The task of linking together information from one or more data sources representing the same entity (patient, customer, provider, business, etc.) If no unique identifier is available, probabilistic linkage techniques have to be applied Applications of record linkage Remove duplicates in a data set (internal linkage) Merge new records into a larger master data set Create patient oriented statist...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010